Defining And Identifying The Roles Of Geographic References Within Text
نویسنده
چکیده
Reliably recognizing, disambiguating, normalizing, storing, and displaying geographic names poses many challenges. However, associating each name with a geographical point location cannot be the final stage. We also need to understand each name’s role within the document, and its association with adjacent text. The paper develops these points through a discussion of two different types of historical texts, both rich in geographic names: descriptive gazetteer entries and travellers’ narratives. It concludes by discussing the limitations of existing mark-up systems in this area. 1 The Great Britain Historical GIS The Great Britain Historical GIS is a very large assembly of historical information about Britain, all in some sense tied to particular places. The earliest data in the system was computerised in the late 1970s, and it was established as a relational database in 1989-91. Until recently, however, almost our entire content was either statistical or locational: by now, we have computerised or acquired from collaborators a substantial fraction of the information published in the reports of the Censuses of Population for England and Wales, and for Scotland; and of the information published in vital registration reports for the same areas since the 1840s. In general, our coverage ends in the early 1970s, when the relevant information began to be published in digital form. Our statistical database by now comprises over 33m. data values, and is closely linked to digital mapping containing the changing boundaries of the various statistical reporting units: counties, various types of district, and approaching 20,000 parishes. This material has formed the basis for studies of demographic, economic and social change. However, our largest source of current funding has a different focus. A grant from part of the UK National Lottery is turning the GBH GIS into an on-line resource for ‘life-long learners’, which in practice means the general public. Our system is not a conventional on-line GIS as our most obvious audience are people interested in local history: the most basic functionality of our site allows users to specify a location by a place-name or, preferably, a postal code which in the UK identifies a group of maybe ten houses, and therefore a fairly precise location; they will then be taken to a page providing information on how their current local authority – there are 408 in Great Britain – has changed over the last 200 years, with the option to drill deeper by accessing information for the various past administrative units which contained the location they specified. An initial system, limited to the above functions for the modern units, should go live during May 2003: www.VisionOfBritain.org.uk Later versions of the site will contain vastly greater content. We will provide access to data for the original historical units, using ‘point-in-polygon’ searching of a spatial database to identify relevant units, having first converted the users’ postal code into a geographic coordinate. The site will draw on both vector mapping of historic boundaries and two complete set of georeferenced image scans of historic maps of Great Britain at one inch-to-the-mile scale: Firstly, we are scanning two complete editions of Ordnance Survey One Inch-tothe-Mile maps of Great Britain: the New Popular Edition, published in the late 1940s and the first to include National Grid lines, simplifying geo-referencing all the scanned maps; and the nineteenth century First Series. Geo-referencing the latter will be challenging, and require extensive ‘rubber sheeting’. We are computerising two existing inventories of historical administrative units covering England and Wales, and constructing an equivalent digital resource covering Scotland. These inventories are not gazetteers but systematic lists of all the counties, parishes and various kinds of districts that ever existed, with their hierarchic relationships and some information on variant names. This information provides the absolute core of our system, structured as an ontology rather than as a strictly hierarchic thesaurus. The core ontology does not require locational information for units, but if available locations are stored as polygons representing the boundaries, with dates of creation and abolition. In our final system, we will be able to offer ‘home pages’ not just for the 408 modern districts but for over 20,000 historic units, including the parishes which were the lowest level of administration until recently, and generally correspond with individual villages. Each home page will contain a map showing the overall location within Britain and a short description generated from the database and highlighting key statistics. From the home page, users will be able to access a more local map showing the unit’s boundary, a range of statistics mostly presented graphically, and information on the unit’s history including boundary changes and hierarchic relationships. Relative to the overall size and scope of the site, our own capacity to author descriptive and explanatory text is limited. We will concentrate on the text to accompany maps showing national patterns. However, a site that was largely limited to statistics, even presented as maps and graphs, would be pretty boring and we are therefore computerising a large quantity of text from existing publications. This text forms the main subject of this paper. There are in fact three types of text. Firstly, we are computerising the introductions to all the census reports between 1801 and 1851, to provide a description of the country as a whole. This aspect of the project is not further discussed here. Secondly, we are computerising three descriptive gazetteers published in the late nineteenth century, totalling over 4,000 pages and containing about 5m. words: • John Bartholomew’s Gazetteer of the British Isles (Edinburgh, 1887). This covers the whole British Isles, including Ireland. • John Goring’s Imperial Gazetteer of England and Wales (Edinburgh, 6 vols., 1870-72) • Frances Groome’s Ordnance Gazetteer of Scotland, (Edinburgh, 6 vols., 1882-85). Our work on this is a collaboration with the Gazetteer for Scotland project. Even with the gazetteer text, it will be very easy for users to locate information about very specific places, but much harder for them to move around the system to explore the relationships between places, and form a ‘vision of Britain through time’ as a whole. This justifies our third and final type of new content: narratives describing historical journeys around Britain. We are computerising four well known accounts, as well as some shorter accounts written by radical agitators as they moved around in the mid-19 century: • William Cobbett, Rural Rides (London, 1830). • Daniel Defoe, A Journey through the whole island of Britain divided into circuits or journies (London, 1724-7). • Celia Fiennes, Through England on a side saddle in the time of William and Mary, being the diary of Celia Fiennes (London, 1888). • Arthur Young, Tours in England and Wales, Selected from the Annals of agriculture (London, 1784-98). This list has been deliberately kept short, as we almost certainly have the capacity to digitise more books via our Optical Character Recognition system but not necessarily to mark them up. Three obvious additions would be the journals of John Wesley, the Torrington Diaries and Boswell and Johnson’s tour of the Hebrides. 2 Descriptive Gazetteers The descriptive gazetteers form a very large body of text, but fortunately they are highly structured, making automated parsing feasible. The parsing software runs within our Oracle database and is written in SQL and PL/SQL. I have no doubt that it would be both more efficient as well as more effective if it were written in, say, Perl. Little more will be said about the software, other than to note that its relative effectiveness is mainly evidence of the vital importance of having a large database of placenames already built. Although we are working with three different books, all are written to a broadly similar formula: • Each consists of alphabetically arranged entries; each entry begins with a head-word, i.e. the place-name usually in bold or upper case letters. • The head-word is followed by an indication of the feature type (‘a parish’, ‘a river’, etc). • Third comes some indication of where the feature is, which almost always indicates a county, sometimes a relative location (‘9 miles SW of Worcester’) and never an absolute location such as latitude and longitude. The main differences between the books is that Bartholomew’s consists of a very large number of short entries while the Imperial Gazetteer and Groome’s provide longer entries, those for major cities and counties covering several pages. Mostly, however, we focus on the first sentence as outlined above. Here are some samples, firstly from the very beginning of Bartholomew’s: • A'an, or Avon, lake, S. Banffshire, among the Cairngorm mountains, 1_ mile long, at alt. of 2250 ft.; it is the head-water of river Avon: which see. • Aasleagh, place, co. Mayo, 16 m. S. of Westport; P.O. • Abbas and Temple Combe, par., mid. Somerset, 4miles S. of Wincanton sta., 1850 ac., pop. 590. • Abbenhall. See ABENHALL. • Abberley, par. and seat, W. Worcestershire, 4 miles SW. of Stourport sta., 2636 ac., pop. 605; P.O. • Abbert, seat, 10 miles NE. of Athenry, co. Galway. • Abbertoft, hamlet, Willoughby par., mid. Lincolnshire, 2 miles SE. of Alford. • Abberton.—par., E. Essex, on Roman road, 4 milesS. of Colchester, 1068 ac., pop. 244; P.O.—2. Abberton, par. and seat, E. Worcestershire, on river Piddle, 4miles NE. of Pershore sta., 1001 ac., pop. 92. • Abberwick, township, Edlingham par., N. Northumberland, on river Alne, 3 miles W. of Alnwick, 1680 ac., pop. 109. • Abbess Roding. See ABBOTS ROOTHING. • Abbethune, seat, 1 m. from Inverkeilor sta., Forfarsh. Secondly, from the Imperial Gazetteer: • AFTON , a village 2 miles S of Yarmouth, Isle of Wight. Afton House adjoins it, on a pleasant slope toward the Yar. Afton Down rises in the south-eastern neighbourhood, overhangs the English Channel, has an altitude of about 500 feet, and is crowned by tumuli. • BINSTEAD, a small village and a parish in the Isle of Wight. The village stands on the coast of the Solent, amid charming environs, 1_ mile W by N of Ryde. The parish comprises 1,140 acres of land and 335 of water; and its post-town is Ryde. Real property, £2,775. Pop., 486. Houses, 105. The manor belonged, at the Conquest, to William Fitz-Stur; and passed to the Bishops of Winchester. Several picturesque villas, one of them belonging to Lord Downes, stand near the village and on the coast. Quarr Abbey House is the seat of Admiral Sir Thomas J. Cochrane. Remains of a Cistertian Abbey, called Quarr Abbey, founded in 1132, by Baldwin de Redvers, afterwards Earl of Devon, stand at a farmstead, 5 furlongs west of the village; and, though fragmentary and mutilated, show some interesting features. A siliceous limestone, containing many fossils, and well suited for building, has been extensively quarried since at least the time of William Rufus. The living is a rectory in the diocese of Winchester. Value, £80.* Patron, the Bishop of Winchester. The church was rebuilt in 1842; is in the early English style; and embodies some sculptured stones of a previous Norman edifice. • BRAMBLE CHINE, a small ravine on the NW coast of the Isle of Wight; at Colwell bay, 2 miles SW of Yarmouth. A thick bed of oyster shells, in a fossil state, is here; the shells in the same position as in life, but entirely decomposed. • CALBOURNE, a village, a parish, and a sub-district in the Isle of Wight. The village stands 5 miles WSW of Newport; and-has a post-office under Newport. The parish includes also Newtown borough; and extends from Brixton Down to the Solent. Acres, 6,397; of which 265 are water. Real property, £4,471. Pop., 728. Houses, 145. The property is divided among a few. Westover manor belonged to the Esturs; passed to the Lisles and the Holmeses; and belongs now to the eldest son of Lord Heytesbury, in right of his wife, the daughter of the late Sir Leonard W. Holmes. The house on it is modern; and the grounds are tasteful. Calbourne Bottom, 1_ mile SSW of the village, is a depression between Brixton and Moltestone downs. The living is a rectory, united with the p. curacy of Newtown, in the diocese of Winchester. Value, £675.* Patron, the Bishop of Winchester. The church is early English, much modernized; and has a brass of 1480.—The subdistrict contains eight parishes. Acres, 25,050. Pop., 5,417. Houses, 1,071. • WIGHT (Isle of), an island in Hants; bounded, on the N, by the Solent,–on the other sides, by the English channel. Its outline is irregularly rhomboidal, and has been compared to that of a turbot, and to that of a bird with expanded wings. Its length from E to W, from Bembridge Point to the Needles, is nearly 23 miles; its greatest breadth from N to S, from West Cowes to St. Catherine’s Point, is 13_ miles; its circuit is about 56 miles; and its area, inclusive of foreshore, is 99,746 acres. The general surface has a considerable elevation above sea-level. The coast, along the N, is low; around the W angle, is rocky, broken, precipitous, and romantic; and along the SW, the S, and the SE, breaks down in a richly varied series of cliffs, often abrupt or mural, extensively terraced and lofty, including all the magnificent range known as the Undercliff, and everywhere replete with scenic interest. The water-shed uniformly follows the trending of the S coast; and is distant from it never more than 2_ miles, generally less than 1 mile. A range of downs extends about 6 miles from St. Catherine’s Hill to Dunnose; rises from the shore, with excessive steepness, to a height of nearly 800 feet; and is marked, along its steep sea-front, with the picturesque terraces of the Undercliff. A diversified range of downs extends about 22 miles, from the Needles on the W to Culver cliff on the E; commences in grand cliffs about 600 feet high; runs 9 miles nearly due east, in a single, sharp, steep ridge, to Mottiston; attains there its highest altitude, at 662 feet above sea-level; makes several debouches in its subsequent progress; suffers repeated cleaving and disseverment, in the form of gaps or depressions; assumes, for some distance, in the neighbourhood of Carisbrooke, the character of a double or a triple range; is, in some parts of its course, saddle-shaped and slender,–in other parts, broad-based and moundish; and divides the island into two pretty nearly equal sections. A transverse ridge, about 400 feet high, extends about 3 miles in the contiguous to the river Yar; and another transverse ridge, tame in feature, but sometimes of considerable height, extends between the Medina and the Brading. The rest of the surface is either undulating or gently sloping, and has little or no claim to be called picturesque. The chief streams are the Yar, the Newton, the Medina, the Wooton, and the Main or Brading. The geognostic structure comprises chiefly lower greensand in most of the S, chalk in part of the centre, and upper eocene in most of the N; but includes many details, possesses deep interest, and may advantageously be studied with the aid of Mantell’s and Martin’s manuals. [just the first paragraph of a long entry] The second example from the Imperial Gazetteer demonstrates the main reason we need to do a limited amount of work on the whole entry, not just the first sentence. For Binstead, the feature type clause is ‘a small village and a parish’, meaning that the place-name is associated with more than one entity. In extreme cases, a single entry covers four distinct entities, so for example Ledbury in Herefordshire was described as being ‘a small town, a parish, a sub-district and a district’. The last three of these terms are all distinct entities within our ontology, and the entries for such places are in fact divided into a series of sections, each beginning with the type of sub-entry. For Binstead, the first part begins ‘The village’ and is just a single sentence; the second part begins ‘The parish’. The texts begin by being scanned in by a specialised Optical Character Recognition system optimised for historic materials, operated by our team based with the Centre for Data Digitisation and Analysis at the Queen’s University Belfast. The OCR output is then visually scanned and tidied up by Information Technology trainees there, and delivered to the project’s main team as Microsoft Word files replicating the source documents as closely as possible. The first stage in our work is breaking the text down into the individual entries, each of which is then loaded as a separate record into our database. The way the text so clearly divides into such discrete sections greatly simplifies how we handle it. One consequence is that, for now, we are not applying any mark-up system to the text itself, other than basic HTML tags to preserve basic formatting, such as bold and italics. Instead, additional structure and search facilities are provided by adding additional columns to the table. What follows describes the parsing process for entries from Bartholomew’s: • Firstly, the end of the head word is identified simply from it being in bold face, and the head word is copied into another column. NB with the gazetteers, identifying the most important geographical name within the text is fairly trivial. • Entries which are cross-references are identified from their containing specific phrases immediately after the head word: ‘See’, ‘also called XXX, which see’, ‘another name of XXX, which see’ and, at present, ‘Welsh name of XXX, which see’. The system then searches for the cross-referenced name elsewhere in the table. • The system then tries to identify the feature type by brute force methods: all the strings immediately following the headword in the first two thousand entries were extracted and sorted, and the section identifying the feature type isolated to give 410 distinct type strings. Each of these was then marked up firstly with a version of itself in which all abbreviations were expanded, and secondly with three flags indicating whether the type indicated the entry was for a county, a parish or a borough. Longer term, these ‘original’ feature types will be mapped onto the Alexandria Digital Library Gazetteer Feature Type
منابع مشابه
بررسی نقش انواع بافتار همنویسهها در تعیین شباهت بین مدارک
Aim: Automatic information retrieval is based on the assumption that texts contain content or structural elements that can be used in word sense disambiguation and thereby improving the effectiveness of the results retrieved. Homographs are among the words requiring sense disambiguation. Depending on their roles and positions in texts, homograph contexts could be divided to different types, wit...
متن کاملThird Party Risk Management
.................................................................................................................................3 Introduction...........................................................................................................................3 Background/Historical Analysis.............................................................................................3 Thir...
متن کاملDefining a Workflow Process for Textual and Geographic Indexing of Documents
Many public organizations are working on the construction of spatial data infrastructures (SDI) that will enable them to share their geographic information. However, not only geographic data are managed in these SDIs, and, in general, in Geographic Information Systems (GIS), but also many textual documents must be stored and retrieved (such as urban planning permissions and administrative files...
متن کاملRetrieving Documents with Geographic References Using a Spatial Index Structure Based on Ontologies
Both Geographic Information Systems and Information Retrieval have been very active research fields in the last decades. Lately, a new research field called Geographic Information Retrieval has appeared from the intersection of these two fields. The main goal of this field is to define index structures and techniques to efficiently store and retrieve documents using both the text and the geogra...
متن کاملامکانسنجی اکوتوریسم شهرستان رامسر با تکنیک SWOT-FANP
Ecotourism, as a modern phenomenon, with a tourism purpose behind, involves observation of and enjoying nature as well as natural and cultural phenomena and landscapes and it can be referred to as one of the new sources of income to serve the purpose of sustainable development. Hence, identifying the capabilities and the methods of developing ecotourism is of great importance in various geograp...
متن کاملAn Ontology-Based Index to Retrieve Documents with Geographic Information
Both Geographic Information Systems and Information Retrieval have been very active research fields in the last decades. Lately, a new research field called Geographic Information Retrieval has appeared from the intersection of these two fields. The main goal of this field is to define index structures and techniques to efficiently store and retrieve documents using both the text and the geogra...
متن کامل